Use of machine learning to identify risk factors for coronary artery disease

Coronary artery disease (CAD) is the leading cause of death in both developed and developing nations. The objective of this study was to identify risk factors for coronary artery disease through machine-learning and assess this methodology. A retrospective, cross-sectional cohort study using the publicly available National Health and Nutrition Examination Survey (NHANES) was conducted in patients who completed the demographic, dietary, exercise, and mental health questionnaire and had laboratory and physical exam data. Univariate logistic models, with CAD as the outcome, were used to identify covariates that were associated with CAD. Covariates that had a p<0.0001 on univariate analysis were included within the final machine-learning model. The machine learning model XGBoost was used due to its prevalence within the literature as well as its increased predictive accuracy in healthcare prediction. Model covariates were ranked according to the Cover statistic to identify risk factors for CAD. Shapely Additive Explanations (SHAP) explanations were utilized to visualize the relationship between these potential risk factors and CAD. Of the 7,929 patients that met the inclusion criteria in this study, 4,055 (51%) were female, 2,874 (49%) were male. The mean age was 49.2 (SD = 18.4), with 2,885 (36%) White patients, 2,144 (27%) Black patients, 1,639 (21%) Hispanic patients, and 1,261 (16%) patients of other race. A total of 338 (4.5%) of patients had coronary artery disease. These were fitted into the XGBoost model and an AUROC = 0.89, Sensitivity = 0.85, Specificity = 0.87 were observed (Fig 1). The top four highest ranked features by cover, a measure of the percentage contribution of the covariate to the overall model prediction, were age (Cover = 21.1%), Platelet count (Cover = 5.1%), family history of heart disease (Cover = 4.8%), and Total Cholesterol (Cover = 4.1%). Machine learning models can effectively predict coronary artery disease using demographic, laboratory, physical exam, and lifestyle covariates and identify key risk factors.


Introduction
Coronary artery disease (CAD) is the leading cause of death in both developed and developing nations [1]. CAD is an atherosclerotic disease that is associated with major complications, a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 including angina, myocardial infarction, and sudden cardiac death [2][3][4][5]. Due to the high prevalence, morbidity, and mortality of CAD, identification of risk factors is a public health priority [6]. Genome-wide association studies have identified several genetic variants linked to CAD [7][8][9][10]. Additionally, epidemiological studies have identified significant socioeconomic, race, and sex disparities in CAD prevalence, quality measures, and outcomes [1,[11][12][13]. Further work has found that a combination of genetic, demographic, and environmental factors contributes to the severity of CAD and other cardiovascular diseases [1,[14][15][16][17]. Furthermore, lifestyle factors, such as diet and exercise, have been found to play an important role in the risk for CAD and other cardiovascular diseases [6,[18][19][20]. These studies have been combined to develop joint risk scores, factoring in both physiological covariates (blood pressure, cholesterol) as well as demographic covariates (age, race, gender) [5,8,9,21]. Despite the strong literature studying the risk factors for CAD, most studies focus upon hypothesis testing or epidemiology focusing upon specific risk factors of interest [22][23][24]. While CAD is recognized as being of "multifactorial" cause, little is known regarding the relative predictive power of different risk factors (lifestyle vs genetic vs chronic disease comorbidities). Given these limitations in the literature, we will leverage transparent machine-learning methods including Shapely Additive Explanations (SHAP model explanations) and model gain statistics to identify pertinent risk-factors for CAD and compute their relative contribution to model prediction of CAD risk; the NHANES 2017-2020 cohort, a large, nationally representative sample of US adults, will be used within this study.

Methods
A retrospective, cross-sectional cohort study using the publicly available National Health and Nutrition Examination Survey (NHANES) was conducted in patients who completed the demographic, dietary, exercise, and mental health questionnaire and had laboratory and physical exam data.

Ethics approval and consent to participate
The acquisition and analysis of the data within this study was approved by the National Center for Health Statistics Ethics Review Board.

Dataset and cohort selection
The National Health and Nutrition Examination Survey (NHANES 2017-2020) is a program designed by the National Center for Health Statistics (NCHS), which has been leveraged to assess the health and nutritional status of the United States population [25]. The NHANES dataset is a series of cross-sectional, complex, multi-stage surveys conducted by the Centers for Disease Control and Prevention (CDC) on a nationally representative cohort of the United States population to provide health, nutritional, and physical activity data. In the present study, we analyzed adult (�18 years old) patients in the NHANES dataset who completed the demographic, dietary, exercise, and mental health questionnaire and had laboratory and physical exam data.

Assessment of coronary artery disease
The medical conditions file was used to define coronary artery disease. Participants were asked: "Has a doctor or other health professional ever told you that you have coronary heart disease?" Participants who answered "Yes" to this question were considered as having CAD within this study.

Independent variable
Potential model covariates were identified within the demographics, dietary, physical examination, laboratory, and medical questionnaire datasets in NHANES. All covariates were extracted and merged with the CAD indicator.

Model construction and statistical analysis
Univariate logistic models, with CAD as the outcome, were used to identify covariates that were associated with CAD. Covariates that had a p<0.0001 on univariate analysis were included within the final machine-learning model. The machine learning model XGBoost was used due to its prevalence within the literature as well as its increased predictive accuracy in healthcare prediction. XGBoost models were fit with a train:test (80: 20), and model accuracy statistics (AUROC, Sensitivity, Specificity, F1, Balanced Accuracy) were computed. Model covariates were ranked according to the Gain, Cover, and Frequency (representations of the relative contribution ("model importance") of each of the covariates) to identify risk factors for CAD. The Gain statistic represents the overall proportion of the model prediction is attributed to a given statistic. The Cover and Frequency are representations of the proportion of trees that each of the covariates appear within the machine-learning model. SHAP explanations were utilized to visualize the relationship between these potential risk factors and CAD. Table 1 shows that f the 7,929 patients that met the inclusion criteria in this study, 4,055 (51%) were female, 2,874 (49%) were male. The mean age was 49.  Table 2 shows the top five highest ranked features by cover, a measure of the percentage contribution of the covariate to the overall model prediction, were age (Cover = 21.1%), Platelet count (Cover = 5.1%), family history of heart disease (Cover = 4.8%), and Total Cholesterol (Cover = 4.1%).

Results
In Fig 2, on SHAP visualization, we observed that: interpret the top four covariates age had a sigmoidal relationship with risk for coronary artery disease. Figs 3, 4a and 4b shows the SHAP Explanations for various SHAP features. We observed that at ages between 20 and 35, there was no significant change in risk for CAD with increasing age, with age increasing between 35 and 70, there was a significant increase in risk for CAD with increasing age, and above 70 years of age, there was no significant increase in CAD with increasing age. Additionally, a curvilinear relationship was observed analyzing the relationship with total-cholesterol and risk for CAD. Patients with significantly decreased total cholesterol were observed to have increased risk for heart disease, and patients with increased cholesterol were observed to also be at increased risk, with a minimum risk around 200 mg/dL of cholesterol. A curvilinear relationship was also observed for the relationship between platelet count and risk for CAD, with significantly decreased platelet counts linked with CAD and significantly increased platelet counts also linked with CAD, a minimum observed around 300,000 cells/uL. Family history was also a significant predictor for CAD. Patients with close relatives having a heart attack in the past had significant increased risk for CAD.

Discussion
In this retrospective, cross sectional cohort of United States adults, a machine learning model utilizing demographic, laboratory, physical examination, and lifestyle questionnaire data had strong predictive accuracy (AUROC = 0.89). The greatest predictors for coronary artery disease included age, total cholesterol, total platelets, and family history of a heart attack. The visualizations completed for the top four covariates were concordant with current literature around the relationship between these covariates and coronary artery disease: there is strong epidemiological and physiological evidence for the link between increased age and cholesterol as major risk factors for coronary artery disease [26][27][28]. The non-linear relationship between cholesterol and coronary artery disease matches survival-modeling and restricted cubic spline analysis from other studies [28][29][30][31][32][33][34][35][36]. Furthermore, multiple genetic and sociological studies have found that family history is a significant risk factor for coronary artery disease [37][38][39][40]. Additionally, low-platelets being associated with coronary artery disease is associated with pathology such as thrombocytopenia [41][42][43]. In addition to the top four covariates within our model, we also wanted to explore if the machine-learning model was able to generate predictions for HDL-Cholesterol and Systolic Blood pressure, two major risk factors that have been widely studied within the cardiovascular literature. In these visualizations, we observed a strong negative relationship between HDL-cholesterol and risk for coronary artery disease (Fig 4a). We observe a curvilinear relationship between systolic blood pressure and coronary heart disease, with blood pressures lower than 120 being associated with increased risk for coronary heart disease and blood pressures above being strongly associated with coronary heart disease as well.
Since visualizations for risk factors match literature relationships, we have increased confidence that the machine learning model is able to capture the actual physiological relationships of these covariates [44][45][46]. These transparent machine-learning tools allow for increased confidence that these algorithms are picking up true signal within these covariates to predict coronary artery disease rather than just replicating potential biases stemming from systemic data-= quality errors that are present within the dataset. Additionally, these SHAP visualizations allow us to interpret that the increase predictive power of these machine-learning methods is associated with the ability for these non-parametric methods to more accurately capture the non-linear interactive relationship between the covariates, rather than just over-fitting the model to get increased accuracy.
The greatest strength of this algorithmic method for identification of the covariates is the ability to search through hundreds of covariates systematically without relying upon judgment form the researcher, which may be muddled by potential personal biases. This method also  allows for the ranking of the relative importance of each of these covariates through the cover statistic, which allows us to obtain the relative contribution to the prediction each covariate has and thus infer from there an estimate for the relative contribution to true risk for coronary artery disease that each patient has. Another strength is that after these covariates are selected and the model built, SHAP visualizations can be used to make sure that each of the covariate either matches current literature understandings of the covariate's association with coronary heart disease or in the case of a discrepancy, allow researchers to validate the plausibility of this feature and then evaluate for potential errors in data-quality.
Some potential weaknesses to this machine-learning analysis is the necessity of the retrospective nature of this cohort. The covariates that were selected within this study will be better at predicting coronary heart disease risk for this cohort than for other cohorts. However, this was limited by the use of training: testing sets to be able to minimize the errors that come with overfitting. Furthermore, visualizations of SHAP allow researchers to test for physiologic plausibility of each of these covariates and allows for effective analysis by researchers of whether these effects are due to true signal or if they are just noise that may be contributing to a type-1 error.
Given the analysis of the strengths and weaknesses of these methods, we argue that use of machine-learning methods can be an effective first step in the identification of risk-factors that can then be further selected by clinicians based upon the specific clinical presentation.

Limitations
This study has several strengths and weaknesses. We utilized the NHANES dataset, which is a retrospective cohort, carrying the limitations of retrospective studies. However, this study allows for the selection of a large cohort, evaluation of data quality, and due to the publicly available nature of the cohort, allows for increased replication and follow-up studies based upon the same cohort. Furthermore, the cohort relied on surveys to obtain the outcome of interest (CAD) as well as the dietary and lifestyle information. More accurate measurements may have been achieved with prospective studies with automated measurement of foods. However, selfreported survey information allows for the volume of participants to be included within this study. Another weakness was the voluntary nature of this cohort, with participants choosing to opt into the study instead of being randomly selected. This may artificially select a different cohort that may significantly differ from the population. However, our analysis found a demographically diverse population, so these results may still be generalizable to other cohorts.

Conclusion
Machine learning models can effectively predict coronary artery disease using demographic, laboratory, physical exam, and lifestyle covariates. Age, total cholesterol, total platelets, and family history of heart attack are the strongest predictors of coronary artery disease.